80 research outputs found

    Time and Frequency Pruning for Speaker Identification

    Get PDF
    International audienceThis work is an attempt to refine decisions in speaker identification. A test utterance is divided into multiple time-frequency blocks on which a normalized likelihood score is calculated. Instead of averaging the block-likelihoods along the whole test utterance, some of them are rejected (pruning) and the final score is computed with a limited number of time-frequency blocks. The results obtained in the special case of time pruning lead the authors to experiment a joint time and frequency pruning approach. The optimal percentage of blocks pruned is learned on a tuning data set with the minimum identification error criterion. Validation of the time-frequency pruning process on 567 speakers leads to a significant error rate reduction (up to 41% reduction on TIMIT) for short training and test duration. ,QWURGXFWLRQ Mono-gaussian models for speaker recognition have been largely replaced by Gaussian Mixture Models (GMM) which are dedicated to modeling smaller clusters of speech. The Gaussian mixture modeling can be seen as a FRRSHUDWLRQ of models since the gaussian mixture density is a weighted linear combination of uni-modal gaussian densities. The work presented here is rather concerned with FRPSHWLWLRQ of models since different mono-gaussian models (corresponding to different frequency subbands) are applied to the test signal and the decision is made with the best or the N-best model scores. More precisely, a test utterance is divided into time-frequency blocks, each of them corresponding to a particular frequency subband and a particular time segment. During the recognition phase, the block scores are accumulated over the whole test utterance to compute a global score and take a final decision. In this work, we investigate accumulation using a hard threshold approach since some block scores are eliminated (pruning) and the final decision is taken with a subset of these scores. This approach should be robust in the case of a time-frequency localized noise since the least reliable time-frequency blocks can be removed. Even in the case of clean speech, some speaker test utterance blocks can be simply more similar to another speaker model than to the target speaker model itself. Removing these error-prone blocks should lead to a more robust decision. In 6HFWLRQ , a formalism is proposed to describe our block-based speaker recognition system. The potential of this approach is shown with a special case of the formalism: time pruning (6HFWLRQ). Experiments intended to find the optimal percentage of blocks pruned are described in 6HFWLRQ. The optimal parameters (percentage of blocks pruned) are validated on TIMIT and NTIMIT databases (6HFWLRQ). Finally, we summarize our main results and outline the potential advantages of the time-frequency pruning procedure in 6HFWLRQ .)RUPDOLVP 0RQRJDXVVLDQ µVHJPHQWDO ¶ PRGHOLQJ Let { } [ W W 0 1≤ ≤ be a sequence of M vectors resulting from the S-dimensional acoustic analysis of a speech signal uttered by speaker X. These vectors are summarized by the mean vector [ and the covariance matrix X: [ 0 [ ; 0 [ [ [

    Localization and Selection of Speaker Specific Information with Statistical Modeling

    Get PDF
    International audienceStatistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. Particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knockout) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity fo

    Empirical evaluation of sequence-to-sequence models for word discovery in low-resource settings

    Get PDF
    Since Bahdanau et al. [1] first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presents an empirical evaluation of 3 of the main sequence-to-sequence models for word discovery from unsegmented phoneme sequences: CNN, RNN and Transformer-based. This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. Evaluating word segmentation quality can be seen as an extrinsic evaluation of the soft-alignment matrices produced during training. Our experiments in a low-resource scenario on Mboshi and English languages (both aligned to French) show that RNNs surprisingly outperform CNNs and Transformer for this task. Our results are confirmed by an intrinsic evaluation of alignment quality through the use Average Normalized Entropy (ANE). Lastly, we improve our best word discovery model by using an alignment entropy confidence measure that accumulates ANE over all the occurrences of a given alignment pair in the collection

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    A small Griko-Italian speech translation corpus

    Get PDF
    This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments

    Unsupervised word segmentation from speech with attention

    Get PDF
    We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation

    Recent advances in the application of stable isotope ratio analysis in forensic chemistry

    Get PDF
    This review paper updates the previous literature in relation to the continued and developing use of stable isotope ratio analysis in samples which are relevant to forensic science. Recent advances in the analysis of drug samples, explosive materials, and samples derived from human and animal samples are discussed. The paper also aims to put the use of isotope ratio mass spectrometry into a forensic context and discuss its evidential potential

    Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments

    Full text link
    Opinion mining on social media posts has become more and more popular. Users often express their opinion on a topic not only with words but they also use image symbols such as emoticons and emoji. In this paper, we investigate the effect of emoji-based features in opinion classification of Uzbek texts, and more specifically movie review comments from YouTube. Several classification algorithms are tested, and feature ranking is performed to evaluate the discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table

    A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users

    Get PDF
    Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users’ atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.This research was supported by the project ‘Favoreciendo la vida autónoma de discapacitados intelectuales con problemas de comunicación oral mediante interfaces personalizados de reconocimiento automático del habla’, financed by the Centre of Initiatives for Development Cooperation (Centro de Iniciativas de Cooperación al Desarrollo, CICODE), University of Granada, Spain. This research was supported by the Student Grant Scheme 2014 (SGS) at the Technical University of Liberec
    corecore